Detecting Hidden Passages in Documents

نویسندگان

  • Saket S.R. Mengle
  • Nazli Goharian
چکیده

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organization. We present our methodology to detect such hidden passages within a document. A document is divided into passages using various document splitting techniques, and a text classifier is used to classify such passages. Our detection rate, as shown empirically, is 76% with an equivalent precision. We provide a comparison of various passage identification methods and also evaluate the effects of passage length and feature selection in this process.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Passage detection using text classification

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. ...

متن کامل

Automatic External Plagiarism Detection Using Passage Similarities - Lab Report for PAN at CLEF 2010

In this paper, we report our approach in detecting external plagiarism. For the pre-processing stage, we identify non-English documents and translate them into English using an online translator tool. Then we index and retrieve the top documents that are similar to the suspicious documents. We divide the retrieved documents into passages where each passage contains twenty sentences. The plagiar...

متن کامل

Extracting Relevant Snippets for Web Navigation

Search engines present fix-length passages from documents ranked by relevance against the query. In this paper, we present and compare novel, language-model based methods for extracting variable length document snippets by real-time processing of documents using the query issued by the user. With this extra level of information, the returned snippets are considerably more informative. Unlike pr...

متن کامل

Detecting Short Passages of Similar Text in Large Document Collections

This paper presents a statistical method for fingerprinting text. In a large collection of independently written documents each text is associated with a fingerprint which should be different from all the others. If fingerprints are too close then it is suspected that passages of copied or similar text occur in two documents. Our method exploits the characteristic distribution of word trigrams,...

متن کامل

HMM-based Passage Models for Document Classification and Ranking

We present an application of Hidden Markov Models to supervised document classification and ranking. We consider a family of models that take into account the fact that relevant documents may contain irrelevant passages; the originality of the model is that it does not explicitly segment documents but rather considers all possible segmentations in its final score. This model generalizes the mul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008